Sparse Covariance Selection via 
Robust Maximum Likelihood Estimation 

Onureena Banerjee* Alexandre d' Aspremontt Laurent El Ghaoui* 

February 1, 2008 

Abstract 

We address a problem of covariance selection, where we seek a trade-off between a high likelihood against the 
number of non-zero elements in the inverse covariance matrix. We solve a maximum likelihood problem with a 
penalty term given by the sum of absolute values of the elements of the inverse covariance matrix, and allow for 
imposing bounds on the condition number of the solution. The problem is directly amenable to now standard interior- 
point algorithms for convex optimization, but remains challenging due to its size. We first give some results on the 
theoretical computational complexity of the problem, by showing that a recent methodology for non-smooth convex 
optimization due to Nesterov can be applied to this problem, to greatly improve on the complexity estimate given by 
interior-point algorithms. We then examine two practical algorithms aimed at solving large-scale, noisy (hence dense) 
instances: one is based on a block-coordinate descent approach, where columns and rows are updated sequentially, 
another applies a dual version of Nesterov's method. 

1 Introduction 

Consider a data set with n variables, drawn from a multivariate Gaussian distribution Af(0, E), where the covariance 
matrix £ is unknown. When the number of variables n is large, estimating the entries of £ becomes a significant 
problem. 

In 1972, Dempster |Dem72| suggested reducing the number of parameters to be estimated by setting to zero some 
elements of the inverse covariance matrix £ _1 . This idea, known as covariance selection, can lead to a more robust 
estimate of £ if enough entries in its inverse are set to zero. Furthermore, conditional independence properties of 
the distribution are determined by the locations of zeros in S -1 . Hence the approach can be used to simultaneously 
determine a robust estimate of the covariance matrix and, perhaps more importantly, discover structure, namely con- 
ditional independence properties, in the underlying graphical model, which is a useful information in its own right. 
Specific applications of covariance selection include speech recognition |Bil99 CG99 BilOO | and gene expression 
data a nalysis IDW04IIDHJ+04I . 

In | BilOO |, Bilmes proposed a method for covariance selection based on choosing statistical dependencies accord- 
ing to conditional mutual information computed using training data. Other recent work involves identifying those 
Gaussian graphical models that are best supported by the data and any available prior information on the covariance 
matrix. This approach is used by lJCD + 04l l&W04 1 and is applied to gene expression data. Recently, | HLP051 I&RV05 1 
considered penalized maximum likelihood estimation and | DRV05 1 in particular propose a set of large scale methods 
to solve sparse problems. 

In this paper we focus on the problem of computing a sparse estimate of the covariance matrix using only a 
large-scale, a priori dense and noisy sample covariance matrix E. Our approach is based on l\ -penalized maximum 
likelihood, and can be interpreted as a "robust maximum likelihood" method, where we assume that the true covariance 
matrix is within a component-wise bounded perturbation of the sample one, and the estimate is chosen to maximize 
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the worst-case (minimal) likelihood. One of our goals is to provide an efficient algorithm to discover structure, rather 
than solve problems where the inverse covariance matrix has an already known sparse structure, as in |DRV05 1. 

Our contributions are as follows: we specify the problem and outline some of its basic properties (section[2); we 
describe how one can apply a recent methodology for convex optimization due to Nesterov |Nes03|, and obtain as 
a result a computational complexity estimate that has a much better dependence on problem size than interior-point 
algorithms (section[3j; we present two algorithms for solving large dense problems (section^: a version of Nesterov's 
method applied to the dual problem, and a block coordinate descent method. In section [5] we present the results of 
some numerical experiments comparing these two algorithms. 

2 Preliminaries 

Problem setup. For a given covariance matrix E e S", and reals numbers < a < (3, our problem is formulated 

as 

p* := max{logdetX- (E,X) - p\\X\\! : al n < X < f3I n } (1) 

with variable X E S", and \\X\\i := ^Z",- =1 \Xij\. The parameter p > controls the size of the penalty, hence the 
sparsity of the solution. Here (E, X) = Tr Y.X denotes the scalar product between the two symmetric matrices E, X. 
The penalty term involving the sum of absolute values of the entries of X is a proxy for the number of its non-zero 
elements, and is often used in regression techniques, such as LASSO in |Tib96|, when sparsity of the solution is a 
concern. In our model, the bounds (a, (3) on the eigenvalues of X are fixed, and user-chosen; although we allow 
a = 0, j3 = +oo in our model, such bounds are useful in practice to control the condition number of the solution. 

For p = 0, and provided S >~ 0, problem Q has a unique solution X* — E _1 , and the corresponding maximum- 
likelihood estimate £ = £. Due to noise in the data, in practice, the sample estimate E may not have a sparse 
inverse, even if the underlying graphical model exhibits conditional independence properties; by striking a trade-off 
between maximality of the likelihood and number of non-zero elements in the inverse covariance matrix, our approach 
is potentially useful at discovering structure, precisely conditional independence properties in the data. At the same 
time, it serves as a regularization technique: when E is rank-deficient, there is no well-defined maximum-likelihood 
estimate, whereas it can be shown that the solution to problem ([0 is always unique and well-defined for p > 0, even 
if a = and/or (3 = oo. 

Robustness, duality, and bounds. In the case when a = 0, (3 = oo, we write Q as 

max min log det X ~ (X, E + U) , 

xyo ||t/||oo<p 

where Ht/Hoo denotes the maximal absolute value of the entries of U. This corresponds to seeking an estimate with 
maximal worst-case likelihood, over all componentwise bounded additive perturbations E+J7 of the sample covariance 
matrix E. Such a "robust optimization" interpretation can be given to a number of estimation problems, most notably 
support vector machines for classification. 

The above leads to the following equivalent, dual problem (with a = 0, (3 = oo): 

d* :=min - logdet(£ + £/) - n : ||C/||oo < p, T, + Uy0. (2) 

Note that the diagonal elements of an optimal U are simply Ua = p. The corresponding covariance matrix estimate is 
E := E + U . Since the above dual problem has a compact feasible set, both primal and dual problems are equivalent. 

In the case when a = 0, (3 = oo, we can derive finite, a priori bounds on the condition number of the solution. 
Indeed, we can show that we can always assume a(n)I n < X < (3{n)l n , where a(n) := 1/(||E|| + np) and (3(n) = 
n/p. These bounds guarantee strict convexity of the objective, as well as existence, boundedness and uniqueness of 
the solution when p > 0, even if no bounds are set a priori. 
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3 Complexity 



First- vs. second-order methods. Of course, problem Q is convex and can readily be solved using interior point 
methods (see |BV04| for example). However, such second-order methods become quickly impractical for solving ([0, 
since the corresponding complexity to compute an e-suboptimal solution is 0(n 6 log(l/e)). The authors of |DRV05| 
developed interior-point algorithms to solve a problem related to Q, where a (sparse) structure of the solution is known 
a priori. Here our focus is on relatively large, dense problems, for which a solution of moderate accuracy is enough. 
Note that we cannot expect to do better than 0(n 3 ), which is the cost of solving the non-penalized problem for dense 
covariance matrices S. The recently-developed first-order algorithms due to |Nes03 1 trade-off a better dependence on 
problem size against a worse dependence on accuracy, usually 1/e instead of its logarithm. The method we describe 
next has a complexity of 0(n 4 5 /e). This is a substantial improvement over interior-point methods when e is not too 
small, which is often the case in practice. In addition, the memory space requirement of the first-order method is 
much lower than that of interior-point methods, which involve storing a dense Hessian, and hence become quickly 
prohibitive with a problem having 0(n 2 ) variables. 



Nesterov's format. We can write (0 in the format given in I Nes03l : 



min max f(X) + (A(X), U) = min f(X), 
xeQi ueQ 2 JfeQi 



where we define f(X) = -logdetX + (E,X), A = pl n 2, and Q 1 := {X e S n : al n d X -< /3I„}, Q 2 := 
{UeS n : HC/IU < 1}. 



Prox-functions and related parameters. To Qi and Q2 we now associate norms and so-called prox-functions. For 
Qi, we use the Frobenius norm, and a prox-function 

d x {X) = -logdetX + log/3. 

The function cfi is strongly convex on Q 1 , with a convexity parameter ofcri = 1 //3 2 , in the sense that V 2 di(X)[H,H] - 
Tr(X~ 1 HX~ 1 H) > (3~ 2 \\H\\p for every H. Furthermore, the center of the set, Xq := axgminxeQi di(X) is 
Xq = (3I n , and satisfies di(Xo) = 0. With our choice, we have D\ := max^gg! di(X) = n log(/3/a). 

To Q2, we also associate the Frobenius norm, and the prox-function d2(U) = ||t/|||./2. With this choice, the 
center Uq of Q2 is Uo = 0. Furthermore, the function is strictly convex on its domain, with convexity parameter 
with respect to the 1-norm <j\ — 1, and we have D2 := max;7eQ 2 d2(U) = n 2 /2. 

The function / has a gradient that is Lipschitz-continuous with respect to the Frobenius norm on the set Qi, with 
Lipschitz constant M = l/a 2 . Finally, the norm (induced by the Frobenius norm) of the operator A is \\A\\ = p. 



Idea of the method. The method is based on replacing the objective of the original problem, f(X), with f e (X), 
where e > is the given desired accuracy, and f e is the penalized function involving the prox-function c^: 

f e (X) := f{X) + max (X, U) - (e/2D 2 )d 2 (U). (3) 

UE Q.1 

The above function turns out to be a smooth uniform approximation to / everywhere, with maximal error e/2. Fur- 
thermore, the function f e is Lipschitz-continuous, with Lipschitz constant given by L(e) := M + D2\\A\\ 2 / (2a2e). A 
specific gradient algorithm for smooth, constrained convex minimization is then applied to the smooth convex func- 
tion f c , with convergence rate in 0(\J L(e)/e). Specifically, the algorithm is guaranteed to produce an e-suboptimal 
solution after a number of steps not exceeding 



N{e) :=M\A\\ X + J = v & ' {Anap+^l). (4) 

<J\02 £ V ^l 6 e 
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Nesterov's algorithm. Choose e > and set Xq — f3I n , and proceed as follows. 
For k = 0, ... , N(e) do 

1. Compute Vf e (X k ) = -A^ 1 + £ + U*(X k ), where U*(X) solves 0. 

2. FindY k =axgmm Y {{Vf e {X k ),Y-X k ) + $L(e)\\Y-X k \\ 2 F : Y e Qi}. 

3. FindZ fe =argmin x {^di(X) + E- =0 ^<V/e(^i),^-^i) : XeQi[ 

4. Update** = ^ fc + |±ly- fc . 

Complexity estimate. For step 1, the gradient of f e is readily computed in closed form, via the computation of 
the inverse of X. Step 2 essentially amounts to projecting on Qi, and requires an eigenvalue problem to be solved; 
likewise for step 3. In fact, each iteration costs 0(n 3 ). The number of iterations necessary to achieve an objective 
with absolute accuracy less than e is given in ® by N(e) = 0(n/e) (if p > 0). Thus, the overall complexity when 
p > is 0(n 4 5 /e), as claimed. 

4 Algorithms 

The algorithms presented next address problem Q in the case when no bounds are given a priori, that is, a = 
and j3 — oo. While they do not share all the nice theoretical computational complexity properties of the algorithm 
presented earlier, they seem to work well in practice. We are currently implementing the algorithm in section [3] and 
will include it in our experiments in a future version of this paper. 

A dual version of Nesterov's method. In our experiments, we have used a preliminary version of Nesterov's method 
with o = and (3 = oo that, instead of working on the primal Q, addresses the dual 0. The algorithm is essentially 
the same as that presented in section|3] with the following setup. Define Qi := {x : U € S n : ||E/||oo < 1}> 

Q 2 = {U eS™ : Tr U < n/p}, and 

f(x)=max(A(X),U)-(j>(U), A = -I n 2, cj>(u) — — logdet U + (£, U). 
ueQ2 

To simplify the projections required by the algorithm, we make the assumption that p is small enough to ensure that 
Qi is entirely included in the cone of positive semidefinite matrices. In practice, we have found that the algorithm 
does well, even when this condition does not hold. As in section[3] each step of this algorithm costs 0(n 3 ) operations. 

To Qi, we associate the Frobenius norm in R nx ™ ? and a prox-function defined for x € Q\ by d\(X) = 
(1/2)||X|||,; the center of Qi with respect to d\ is Xq = 0, and D\ = m&xxeQi di(X) — p 2 n 2 /2. Furthermore, the 
function di is strictly convex on its domain, with convexity parameter with respect to the Frobenius norm o\ = 1. For 
Q2 we use the dual of the standard matrix norm (denoted || • |||), and a prox-function d2(U) = Tr(U\ogU) +logn, 
where log refers to the matrix (and not componentwise) logarithm. The center of the set Q2 is Uq = n _1 I n , and 
^2(^0) = 0. We have max ue g 2 d2(u) < \ogn := Z?2- The convexity parameter of on its domain with respect to 
|| ■ || 2, is bounded below by 02 — 1/2 [BTN04|. Finally, the norm of the operator A, with respect to the Frobenius 
norm and the dual of the standard norm, is 1. 

Dual block-coordinate descent. We now describe an algorithm based on block-coordinate descent, which solves 
the dual problem (0 by optimizing over one column/row pair at a time. For simplicity, we consider the case when 
a = and /3 = 00, which corresponds to providing no a priori bounds on the condition number of the solution. 

The algorithm is initialized with the estimate £ = £ + pi y 0, since the optimal diagonal values of U in 10 
are simply p. To update £, we solve Q where all elements in U, except the off-diagonal values of one column (and 
corresponding row), are fixed. Using a permutation of rows and columns, we can always express the problem as one 
of optimizing over the last column/row pair: 

min logdet f^ 11 % + U ) : |d < p, (5) 
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where inequalities are understood componentwise, and the current estimate £ of the covariance matrix is partitioned 
into four blocks, with En the upper-left (n — 1) x (n — 1) block, and (£12, 022) the last column. The sub-problem 
Q reduces to a simple box-constrained quadratic program. Specifically, the corresponding column/row £12 of the 
current estimate £ is updated with £i 2 + u* , where 



argmin v^Y,^" 



(6) 



The stopping criterion we can use involves checking if the primal-dual gap of the problem (|2jl is less than a given 
tolerance e, which translates as (£, X) + p||X||i < n + e, where X — C~ l . In practice though, we have found that a 
fixed number of sweeps (say, K = 4) through all column/row pairs is usually enough. 

It can be proven that the algorithm converges, however we do not know the complexity of the algorithm. A simple 
modification of it, where we impose a condition number constraint on the estimate £ at each column/row update, 
can be shown to converge in 0(n 6 ). Hence, the dual BCD algorithm is not really competitive from the theoretical 
complexity point of view, but the version given here remains attractive because it is simple to implement, and does 
well in practice. Since each column/row update costs 0(n 3 ), when only a few, fixed number K of sweeps through all 
columns is done, the cost is 0(Kn 4 ). 

5 Numerical Results 

Recovering structure. To illustrate our method we consider a synthetic example constructed by adding noise to a 
covariance matrix which has a sparse inverse. Precisely, we take a random n x n sparse matrix A, with n — 30, and 
then add a uniform noise of magnitude a = .13 to its inverse to form our matrix E. Our concern here is to check if our 
methods reach a degree of precision sufficient to recover the problem structure, despite the noise masking it. Figure[T] 
shows the original sparse matrix A, the noisy inverse £ _1 and the solution X of problem Q solved using Nesterov's 
algorithm detailed in section|4]with parameter p = a. 
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Original inverse A 



Noisy inverse E~ 



Solution for p = a 



Figure 1 : Recovering the sparsity pattern. We plot the original inverse covariance matrix A, the noisy inverse E 1 
and the solution to problem Q for p = .13. 



Of course, the simple example above assumes that we set p equal to the noise level a, so we look at what happens 
when the value of p is set above or below a. In Figure|5]on the left, we solve problem Q for various values of p. We 
plot the minimum min-fXy : Aij 7^ 0} (solid) and mean (dash-dot) magnitude of those coefficients in the solution 
X corresponding to non zero coefficients in A (solid line) against the maximum max{Jy : A^ — 0} (dashed line) 
and mean (dotted line) magnitude of coefficients in X corresponding to zeros in A (we only consider off-diagonal 
coefficients). For all values of p within the interval V shown in Figure|2j the minimum is larger than the maximum, 
meaning that for all thresholding levels between the minimum and maximum, we exactly recover the original matrix 
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A. As we can observe, the range V of values of p for which this happens is fairly large and so is the gap between the 
minimum and maximum within the interval V. 

In Figure |2 on the right, for various values of p, we randomly sample 10 noisy matrices £ with n — 50 and 
a = .1 and compute the number of misclassified zeros and nonzero elements in the solution to Q produced by the 
block coordinate descent method. We plot the average percentage of errors (number of elements incorrectly set to zero 
plus number of elements incorrectly kept nonzero, divided by n 2 ), as well as error bars corresponding to one standard 
deviation. 




Large-scale problems. We now compare Nesterov's and coordinate descent algorithms on a set of randomly gen- 
erated examples. The noisy matrix £ is generated as above and we plot in Figure[3]CPU time against problem size 
for various n and duality gap versus CPU time in seconds for a random problem of size 100 with a few nonzero 
coefficients. In practice, a low-precision solution to Q is sufficient to identify most of the nonzero coefficients in the 
original matrix A. In Figure@]we show the classification error made by the solution on the example with n = 100. 
Typical computing time for a problem with n = 300 is about 20 minutes. (All CPU times computed on a 1.5Ghz 
PowerBook G4 laptop). 
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Problem Size n 



CPU Time (in seconds) 



Figure 3: Computing time. Left: we plot CPU time (in seconds) to reach gap of e = 1 versus problem size n 
on random problems, solved using Nesterov's method (stars) and the coordinate descent algorithm (circles). Right: 
convergence plot a random problem with n = 100, this time comparing Nesterov's method where e = 5 (solid line A) 
and e = 1 (solid line B) with one sweep of the coordinate descent method (dashed line C). 
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Figure 4: Classification Error. Left: we plot the coefficient magnitudes for the original matrix A (solid line) and the 
solution (dashed line) in decreasing order. The dotted line is at the signal to noise level a. Right: we plot the magnitude 
of those coefficients in the solution associated with nonzero elements in A (solid line), ranked in increasing magnitude, 
together with the magnitude of those coefficients in the solution associated with zero elements in A (dashed line), in 
decreasing order of magnitude. Again, the dotted line is at the signal to noise level a and we only consider off-diagonal 
elements. 



